grpc: set MaxConcurrentStreams to avoid sudden traffic spikes that lead to PD OOM#8977
grpc: set MaxConcurrentStreams to avoid sudden traffic spikes that lead to PD OOM#8977okJiang wants to merge 1 commit intotikv:masterfrom
Conversation
Signed-off-by: okJiang <819421878@qq.com>
Codecov Report❌ Patch coverage is ❌ Your patch check has failed because the patch coverage (57.14%) is below the target coverage (74.00%). You can increase the patch coverage or adjust the target coverage. Additional details and impacted files@@ Coverage Diff @@
## master #8977 +/- ##
==========================================
+ Coverage 76.31% 77.61% +1.30%
==========================================
Files 465 532 +67
Lines 70547 94134 +23587
==========================================
+ Hits 53839 73065 +19226
- Misses 13361 17187 +3826
- Partials 3347 3882 +535
Flags with carried forward coverage won't be shown. Click here to find out more. 🚀 New features to boost your workflow:
|
| defaultGCTunerThreshold = 0.6 | ||
| minGCTunerThreshold = 0 | ||
| maxGCTunerThreshold = 0.9 | ||
| // If concurrentStreams reaches 600k, the memory usage is about 40GB. To |
There was a problem hiding this comment.
Is there any tests to support the conclusion?
There was a problem hiding this comment.
This conclusion comes from a practical case where the cluster contains millions of regions. And its requests are ScanRegions mainly.
There was a problem hiding this comment.
CMIIW, it only limits the concurrency for one connection, not total concurrency on the server side.
There was a problem hiding this comment.
In simple terms, should we consider this sudden surge in traffic as an anomaly? If so, I think we can set this parameter to protect PD. Do you know in what normal circumstances PD would experience such high traffic? For example, 16GB PD and 160k requests.
There was a problem hiding this comment.
// MaxConcurrentStreams specifies the maximum number of concurrent
// streams that each client can open at a time.
If users make massive requests using multiple clients, we have no way to limit it...
There was a problem hiding this comment.
Is it applicable to unary and stream?
There was a problem hiding this comment.
Yes. This is a gRPC/HTTP2 transport-level limit, so it applies to both unary RPCs and streaming RPCs.
- Unary: each in-flight unary call occupies one stream until the call finishes.
- Stream: each open client/server/bidi stream occupies one stream for its whole lifetime.
- Scope: it limits concurrent streams per client connection (server transport), not the total concurrency across all clients.
So it can help on both paths, but for streaming it only limits the number of open streams, not the number of messages within one stream. I think this thread is about clarifying the behavior, so no code change is needed here.
There was a problem hiding this comment.
Rechecked: yes, it applies to both unary RPCs and streaming RPCs, because this limit is enforced at the gRPC/HTTP2 stream layer per client connection.
- Unary RPC: one in-flight call uses one stream until it completes.
- Streaming RPC: one open client/server/bidi stream holds one stream for its lifetime.
- It does not limit the total concurrency across multiple client connections.
So the knob is effective for both unary and stream requests, but for streaming it limits the number of open streams rather than the number of messages sent on one stream. No code change is needed for this clarification.
|
No more update and comment, close it now. |
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
|
/cc @lhy1024 |
What problem does this PR solve?
Issue Number: Close #8882, ref #4480
What is changed and how does it work?
Added
MaxConcurrentStreamsto limit the request concurrency. This is a self-protection mechanism of PD. Once the number of concurrent requests reaches the limit, gRPC will wait for the ongoing requests to finish and allocate resources to the waiting requests. Therefore, in this situation, the request time may slow down, but it will provide better robustness.PS: This parameter cannot be modified during runtime, so we cannot support changing it by pd-ctl. If you think it should be configurable, please comment.
Check List
Tests
Release note